NOTE: This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
In previous practice/session, you learned about how to use local spatial statistics for exploratory spatial data analysis.
For this practice you will need the following:
The shape file includes spatial information for Traffic Analysis Zones (TAZ) in the Hamilton Census Metropolitan Area (as polygons).
In this practice, you will:
O’Sullivan D and Unwin D (2010) Geographic Information Analysis, 2nd Edition, Chapter 7. John Wiley & Sons: New Jersey.
As usual, it is good practice to clear the working space to make sure that you do not have extraneous items there when you begin your work. The command in R to clear the workspace is rm (for “remove”), followed by a list of items to be removed. To clear the workspace from all objects, do the following:
rm(list = ls())
Note that ls() lists all objects currently on the worspace.
Load the libraries you will use in this activity:
library(tidyverse)
-- Attaching packages --------------------------------------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.0.0 v purrr 0.2.5
v tibble 1.4.2 v dplyr 0.7.5
v tidyr 0.8.1 v stringr 1.3.1
v readr 1.1.1 v forcats 0.3.0
package 㤼㸱ggplot2㤼㸲 was built under R version 3.4.4package 㤼㸱tidyr㤼㸲 was built under R version 3.4.4package 㤼㸱purrr㤼㸲 was built under R version 3.4.4package 㤼㸱dplyr㤼㸲 was built under R version 3.4.4package 㤼㸱stringr㤼㸲 was built under R version 3.4.4package 㤼㸱forcats㤼㸲 was built under R version 3.4.4-- Conflicts ------------------------------------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(rgdal)
package 㤼㸱rgdal㤼㸲 was built under R version 3.4.4Loading required package: sp
package 㤼㸱sp㤼㸲 was built under R version 3.4.4rgdal: version: 1.3-2, (SVN revision 755)
Geospatial Data Abstraction Library extensions to R successfully loaded
Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
Path to GDAL shared files: C:/Users/Antonio/Documents/R/win-library/3.4/rgdal/gdal
GDAL binary built with GEOS: TRUE
Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
Path to PROJ.4 shared files: C:/Users/Antonio/Documents/R/win-library/3.4/rgdal/proj
Linking to sp version: 1.3-1
library(broom)
package 㤼㸱broom㤼㸲 was built under R version 3.4.4
library(spdep)
package 㤼㸱spdep㤼㸲 was built under R version 3.4.4Loading required package: Matrix
Attaching package: 㤼㸱Matrix㤼㸲
The following object is masked from 㤼㸱package:tidyr㤼㸲:
expand
Loading required package: spData
package 㤼㸱spData㤼㸲 was built under R version 3.4.4
library(reshape2)
Attaching package: 㤼㸱reshape2㤼㸲
The following object is masked from 㤼㸱package:tidyr㤼㸲:
smiths
library(plotly)
package 㤼㸱plotly㤼㸲 was built under R version 3.4.4
Attaching package: 㤼㸱plotly㤼㸲
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
last_plot
The following object is masked from 㤼㸱package:stats㤼㸲:
filter
The following object is masked from 㤼㸱package:graphics㤼㸲:
layout
library(knitr)
package 㤼㸱knitr㤼㸲 was built under R version 3.4.4
library(kableExtra)
package 㤼㸱kableExtra㤼㸲 was built under R version 3.4.4
library(spgwr)
NOTE: This package does not constitute approval of GWR
as a method of spatial analysis; see example(gwr)
Begin by loading the shape file:
Hamilton_TAZ <- readOGR(".", layer = "Hamilton CMA tts06")
OGR data source with driver: ESRI Shapefile
Source: "C:\Antonio\Courses\GEOG 4GA3 - Applied Spatial Analysis\Spatial-Statistics-Course\15. Area Data VI\01. Readings and Practice", layer: "Hamilton CMA tts06"
with 297 features
It has 12 fields
Integer64 fields read as strings: ID NUM GTA06 GTA01 AREA_M AREA_H
The shape file includes the geometry of the zones only.
To use the plotting functions of ggplot2, the SpatialPolygonDataFrame needs to be “tidied” by means of the tidy function of the broom package:
Hamilton_TAZ.t <- tidy(Hamilton_TAZ, region = "GTA06")
Hamilton_TAZ.t <- dplyr::rename(Hamilton_TAZ.t, GTA06 = id)
Tidying the spatial dataframe strips it from the non-spatial information, but we can add all the data by means of the left_join function:
Hamilton_TAZ.t <- left_join(Hamilton_TAZ.t, Hamilton_TAZ@data, by = "GTA06")
Column `GTA06` joining character vector and factor, coercing into character vector
Now the tidy dataframe Hamilton_DA.t contains the spatial information and the data.
You can quickly verify the contents of the dataframe by means of summary:
summary(Hamilton_TAZ.t)
long lat order hole piece group
Min. :-80.25 Min. :43.05 Min. : 1 Mode :logical 1:11772 4050.1 : 266
1st Qu.:-79.90 1st Qu.:43.21 1st Qu.: 2949 FALSE:11784 2: 16 4052.1 : 239
Median :-79.85 Median :43.25 Median : 5896 TRUE :8 3: 4 6007.1 : 239
Mean :-79.83 Mean :43.26 Mean : 5896 5211.1 : 226
3rd Qu.:-79.79 3rd Qu.:43.29 3rd Qu.: 8844 5191.1 : 191
Max. :-79.51 Max. :43.48 Max. :11792 6018.1 : 148
(Other):10483
GTA06 ID AREA NUM PD REGION
Length:11792 2299 : 266 Min. : 0.1083 0 :10560 Min. : 0.00 Min. : 5.000
Class :character 2301 : 239 1st Qu.: 1.0316 11007 : 239 1st Qu.: 0.00 1st Qu.: 6.000
Mode :character 583 : 239 Median : 1.7958 11018 : 148 Median : 0.00 Median : 6.000
2842 : 226 Mean : 6.1606 11008 : 127 Mean : 8.68 Mean : 6.305
2823 : 191 3rd Qu.: 4.1323 11006 : 107 3rd Qu.: 0.00 3rd Qu.: 6.000
594 : 148 Max. :90.7720 11012 : 107 Max. :40.00 Max. :11.000
(Other):10483 (Other): 504
GTA01 AREA_M AREA_H UTMX_CENT UTMY_CENT DISTRICT_N
0 :9104 0:11792 0:11792 Min. :0 Min. :0 11:1232
2050 : 266 1st Qu.:0 1st Qu.:0 5 :2559
2052 : 239 Median :0 Median :0 6 :8001
2556 : 110 Mean :0 Mean :0
2091 : 96 3rd Qu.:0 3rd Qu.:0
2098 : 63 Max. :0 Max. :0
(Other):1914
Previously you learned about the use of Moran’s I coefficient as a diagnostic in regression analysis.
Residual spatial autocorrelation is a symptom of a model that has not been properly specified. There are two reasons for this that are of interest:
Lets explore these in turn.
To illustrate this, we will simulate a spatial process as follows: \[ z = f(x,y) = exp(\beta_0)exp(\beta_1x)exp(\beta_2y) + \epsilon_i \]
Clearly, this is a non-linear spatial process.
The simulation is as follows, with a random term with a mean of zero and standard deviation of 1. The random terms are independent by design:
set.seed(10)
b0 = 1
b1 = 2
b2 = 4
xy_coords <- coordinates(Hamilton_TAZ)
Hamilton_TAZ@data <- mutate(Hamilton_TAZ@data,
x = xy_coords[,1] - min(xy_coords[,1]),
y = xy_coords[,2] - min(xy_coords[,2]),
z = exp(b0) * exp(b1 * x) * exp(b2 * y) +
rnorm(n = 297, mean = 0, sd = 1))
package 㤼㸱bindrcpp㤼㸲 was built under R version 3.4.4
summary(Hamilton_TAZ@data[,13:15])
x y z
Min. :0.0000 Min. :0.0000 Min. : 3.761
1st Qu.:0.2810 1st Qu.:0.1226 1st Qu.: 8.222
Median :0.3316 Median :0.1550 Median : 9.958
Mean :0.3348 Mean :0.1681 Mean :10.885
3rd Qu.:0.3850 3rd Qu.:0.1976 3rd Qu.:12.534
Max. :0.6518 Max. :0.3683 Max. :22.358
Suppose that we estimate the model as a linear regression that does not correctly capture the non-linearity. The model would be as follows:
model1 <- lm(formula = z ~ x + y, data = Hamilton_TAZ@data)
summary(model1)
Call:
lm(formula = z ~ x + y, data = Hamilton_TAZ@data)
Residuals:
Min 1Q Median 3Q Max
-2.73799 -0.85271 0.01062 0.77256 3.07456
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.2480 0.3169 -13.41 <2e-16 ***
x 21.9485 0.6798 32.29 <2e-16 ***
y 46.2989 1.0019 46.21 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.196 on 294 degrees of freedom
Multiple R-squared: 0.9014, Adjusted R-squared: 0.9007
F-statistic: 1343 on 2 and 294 DF, p-value: < 2.2e-16
At first glance, the model gives the impression of a very good fit: all coefficients are significant, and the coefficient of multiple determination \(R^2\) is very high.
At this point, it is important to examine the residuals to verify that they are independent. Lets add the residuals of this model to your dataframes:
Hamilton_TAZ@data$model1.e <- model1$residuals
Hamilton_TAZ.t <- left_join(Hamilton_TAZ.t,
data.frame(GTA06 = Hamilton_TAZ$GTA06, model1$residuals))
Joining, by = "GTA06"
Column `GTA06` joining character vector and factor, coercing into character vector
Hamilton_TAZ.t <- rename(Hamilton_TAZ.t, model1.e = model1.residuals)
A map of the residuals can help examine their spatial pattern:
map.e1 <- ggplot(data = Hamilton_TAZ.t, aes(x = long, y = lat, group = group,
fill = model1.e)) +
geom_polygon(color = "white") +
coord_equal() +
scale_fill_distiller(palette = "RdBu")
ggplotly(map.e1)
To test the residuals for spatial autocorrelation we first create a set of spatial weights:
Hamilton_TAZ.w <- nb2listw(poly2nb(Hamilton_TAZ))
With this, we can now calculate Moran’s I:
moran.test(Hamilton_TAZ$model1.e, Hamilton_TAZ.w)
Moran I test under randomisation
data: Hamilton_TAZ$model1.e
weights: Hamilton_TAZ.w
Moran I statistic standard deviate = 9.3976, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.317160954 -0.003378378 0.001163393
The test does not allow us to reject the null hypothesis of spatial independence. Thus, despite the apparent goodness of fit of the model, there is reason to believe something is missing.
Lets now use a variable transformation to approximate the underlying non-linear process:
model2 <- lm(formula = log(z) ~ x + y, data = Hamilton_TAZ@data)
summary(model2)
Call:
lm(formula = log(z) ~ x + y, data = Hamilton_TAZ@data)
Residuals:
Min 1Q Median 3Q Max
-0.31731 -0.06268 0.00447 0.07483 0.28175
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.96804 0.02721 35.58 <2e-16 ***
x 2.07419 0.05837 35.53 <2e-16 ***
y 3.97461 0.08603 46.20 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1027 on 294 degrees of freedom
Multiple R-squared: 0.9066, Adjusted R-squared: 0.9059
F-statistic: 1426 on 2 and 294 DF, p-value: < 2.2e-16
This model does not necessarily have a better goodness of fit. However, when we test for spatial autocorrelation:
Hamilton_TAZ@data$model2.e <- model2$residuals
moran.test(Hamilton_TAZ$model2.e, Hamilton_TAZ.w)
Moran I test under randomisation
data: Hamilton_TAZ$model2.e
weights: Hamilton_TAZ.w
Moran I statistic standard deviate = 0.58286, p-value = 0.28
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.016485452 -0.003378378 0.001161455
Once that the correct functional form has been specified, the model is better at capturing the underlying process (check how the coefficients approximate to a high degree the true coefficients of the model). In addition, we can conclude that the residuals are independent, and therefore are now also spatially random: meaning the there is nothing left of the process but white noise.
Using the same example, suppose now that the function form is correctly specified, but a relevant variable is missing:
model3 <- lm(formula = log(z) ~ x, data = Hamilton_TAZ@data)
summary(model3)
Call:
lm(formula = log(z) ~ x, data = Hamilton_TAZ@data)
Residuals:
Min 1Q Median 3Q Max
-0.70236 -0.17279 -0.04807 0.13260 0.83999
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.81774 0.05753 31.60 <2e-16 ***
x 1.53226 0.16406 9.34 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2947 on 295 degrees of freedom
Multiple R-squared: 0.2282, Adjusted R-squared: 0.2256
F-statistic: 87.23 on 1 and 295 DF, p-value: < 2.2e-16
As before, lets append the residuals to the dataframes:
Hamilton_TAZ@data$model3.e <- model3$residuals
Hamilton_TAZ.t <- left_join(Hamilton_TAZ.t,
data.frame(GTA06 = Hamilton_TAZ$GTA06, model3$residuals))
Joining, by = "GTA06"
Column `GTA06` joining character vector and factor, coercing into character vector
Hamilton_TAZ.t <- rename(Hamilton_TAZ.t, model3.e = model3.residuals)
A map of the residuals can help examine their spatial pattern:
map.e3 <- ggplot(data = Hamilton_TAZ.t, aes(x = long, y = lat, group = group,
fill = model3.e)) +
geom_polygon(color = "white") +
coord_equal() +
scale_fill_distiller(palette = "RdBu")
ggplotly(map.e3)
In this case, the visual inspection makes it clear that there is an issue with spatially autocorrelated residuals, something that a test reinforces:
moran.test(Hamilton_TAZ$model3.e, Hamilton_TAZ.w)
Moran I test under randomisation
data: Hamilton_TAZ$model3.e
weights: Hamilton_TAZ.w
Moran I statistic standard deviate = 24.616, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.835679736 -0.003378378 0.001161885
As seen above, the model with the full set of relevant variables resolves this problem.
When spatial autocorrelation is detected in the residuals, further work is warranted. The preceding examples illustrate two possible solutions to the issue of residual pattern:
Ideally, we would try to ensure that the model is properly specified. In practice, however, it is not always evident what the functional form of the model should be. The search for an appropriate functional form can be guided by theoretical considerations, empirical findings, and experimentation. With respect to inclusion of relevant variables, it is not always possible to find all the information we desire. This could be because of limited resources, or because some aspects of the process are not known and therefore we do not even know what additional information should be collected.
In these cases, it is a fact that residual spatial autocorrelation is problematic.
Fortunately, a number of approaches have been proposed in the literature that can be used for remedial action.
In the following sections we will review some of them.
Some models use variable transformations to create more flexible functions, while others use adaptive estimation strategies.
Trend surface analysis is a simple way to generate relatively flexible surfaces.
This approach consists of using the coordinates as covariates, and transforming them into polynomials of different orders. Seen this way, linear regression is the analog of a trend surface of first degree: \[ z = f(x,y) = \beta_0 + \beta_1x + \beta_2y \] where \(x\) and \(y\) are the coordinates.
A figure illustratates how the function above creates a regression plane. First, create a grid of coordinates for plotting:
df <- expand.grid(x = seq(from = -2, to = 2, by = 0.2), y = seq(from = -2, to = 2, by = 0.2))
Next, select some values for the coefficients (feel free to experiment with these values):
b0 <- 0.5 #0.5
b1 <- 1 #1
b2 <- 2 #2
z1 <- b0 + b1 * df$x + b2 * df$y
z1 <- matrix(z1, nrow = 21, ncol = 21)
The plot is as follows:
plot_ly(z = ~z1) %>% add_surface() %>%
layout(scene = list(xaxis = list(ticktext = c("-2", "0", "2"), tickvals = c(0, 10, 20)),
yaxis = list(ticktext = c("-2", "0", "2"), tickvals = c(0, 10, 20))
)
)
A trend surface of second degree, or quadratic, would be as follows. Notice how it includes all possible quadratic terms, including the product \(xy\): \[ z = f(x,y) = \beta_0 + \beta_1x^2 + \beta_2x + \beta_3xy + \beta_4y + \beta_5y^2 \]
Use the same grid as above to create now a regression surface. Select some coefficients:
b0 <- 0.5 #0.5
b1 <- 2 #2
b2 <- 1 #1
b3 <- 1 #1
b4 <- 1.5 #1.5
b5 <- 0.5 #2.5
z2 <- b0 + b1 * df$x^2 + b2 * df$x + b3 * df$x * df$y + b4 * df$y + b5 * df$y^2
z2 <- matrix(z2, nrow = 21, ncol = 21)
And the plot is as follows:
plot_ly(z = ~z2) %>% add_surface() %>%
layout(scene = list(xaxis = list(ticktext = c("-2", "0", "2"), tickvals = c(0, 10, 20)),
yaxis = list(ticktext = c("-2", "0", "2"), tickvals = c(0, 10, 20))
)
)
Higher order polynomials (i.e., cubic, quartic, etc.) are possible in principle. Something to keep in mind is that the higher the order of the polynomial, the more flexible the surface, which may lead to the following issues:
Powers of variables tend to be highly correlated with each other. See the following table of correlations for the x coordinate in the example:
| x | x^2 | x^3 | x^4 | |
|---|---|---|---|---|
| x | 1.00 | 0.00 | 0.92 | 0.00 |
| x^2 | 0.00 | 1.00 | 0.00 | 0.96 |
| x^3 | 0.92 | 0.00 | 1.00 | 0.00 |
| x^4 | 0.00 | 0.96 | 0.00 | 1.00 |
When two variables are highly collinear, the model has difficulties discriminating their relative contribution to the model. This is manifested by inflated standard errors that may depress the significance of the coefficients, and occasionally by sign reversals.
Overfitting is another possible consequence of using a trend surface that is too flexible. This happens when a model fits too well the observations used for callibration, but because of this it may fail to fit well new information.
To illustrate overfitting consider a simple example. Below we simulate a simple linear model with \(y_i = x_i + \epsilon_i\) (the random terms are drawn from the uniform distribution). We also simulate new data using the exact same process:
# Dataset for estimation
df.of1 <- data.frame(x = seq(from = 1, to = 10, by = 1))
df.of1 <- mutate(df.of1, y = x + runif(10, -1, 1))
# New data
new_data <- data.frame(x = seq(from = 1, to = 10, by = 0.5))
df.of2 <- mutate(new_data, y = x + runif(nrow(new_data), -1, 1))
This is the scatterplot of the observations in the estimation dataset:
p <- ggplot(data = df.of1, aes(x = x, y = y))
p + geom_point(size = 3)
A model with a first order trend (essentially linear regression), does not fit the observations perfectly, but when confronted with new data (plotted as red squares), it predicts them with reasonable accuracy:
mod.of1 <- lm(formula = y ~ x, data = df.of1)
pred1 <- predict(mod.of1, newdata = new_data) #mod.of1$fitted.values
p + geom_abline(slope = mod.of1$coefficients[2], intercept = mod.of1$coefficients[1],
color = "blue", size = 1) +
geom_point(data = df.of2, aes(x = x, y = y), shape = 0, color = "red") +
geom_segment(data = df.of2, aes(xend = x, yend = pred1)) +
geom_point(size = 3) +
xlim(c(1, 10))
Compare to a polynomial of very high degree (nine in this case). The model is much more flexible, to the extent that it perfectly matches the observations in the estimation dataset. However, this flexibility has a downside. When the model is confronted with new information, its performance is less satisfactory.
mod.of2 <- lm(formula = y ~ poly(x, degree = 9, raw = TRUE), data = df.of1)
poly.fun <- predict(mod.of2, data.frame(x = seq(1, 10, 0.1)))
pred2 <- predict(mod.of2, newdata = new_data) #mod.of1$fitted.values
p +
#stat_function(fun = fun.pol,
geom_line(data = data.frame(x = seq(1, 10, 0.1), y = poly.fun), aes(x = x, y = y),
color = "blue", size = 1) +
geom_point(data = df.of2, aes(x = x, y = y), shape = 0, color = "red") +
geom_segment(data = df.of2, aes(xend = x, yend = pred2)) +
geom_point(size = 3) +
xlim(c(1, 10))
We can compute the root mean square (RMS), for each of the two models. The RMS is a measure of error calculated as the square root of the mean of the squared differences between two values (in this case the prediction of the model and the new information). This statistic is a measure of the typical deviation between two sets of values. Given new information, the RMS would tell us the expected size of the error when making a prediction using a given model.
The RMS for model 1 is:
sqrt(mean((df.of2$y - pred1)^2))
[1] 0.525595
And for model 2:
sqrt(mean((df.of2$y - pred2)^2))
[1] 1.681143
You will notice how model 2, despite fitting the estimation data better than model 1, typically produces larger errors when new information becomes available.
Another consequence of overfitting, is that the resulting functions tend to display extreme behavior when taken outside of their estimation range, where the largest polynomial terms tend to dominate.
The plot below is the same high degree polynomial estimated above, just plotted in a slightly larger range of plus/minus one unit:
poly.fun <- predict(mod.of2, data.frame(x = seq(0, 11, 0.1)))
p +
geom_line(data = data.frame(x = seq(0, 11, 0.1), y = poly.fun), aes(x = x, y = y),
color = "blue", size = 1) +
geom_point(data = df.of2, aes(x = x, y = y), shape = 0, color = "red") +
geom_segment(data = df.of2, aes(xend = x, yend = pred2)) +
geom_point(size = 3)
Another way to generate flexible functional forms is by means of models with spatially varying coefficients. Two approaches are reviewed here.
The expansion method (Casetti, 1972) is an approach to generate models with contextual effects. It follows a philosophy of specifying first a substantive model with variables of interest, and then an expanded model with contextual variables. In geographical analysis, typically the contextual variables are trend surfaces estimated using the coordinates of the observations.
To illustrate this, suppose that there is the following initial model of proportion of donors in a population, with two variables of substantive interest (say, income and education): \[ d_i = \beta_i(x_i,y_i) + \beta_1(x_i,y_i)I_i + \beta_3(x_i,y_i)Ed_i + \epsilon_i \]
Note how the coefficients are now a function of the coordinates at \(i\). Unlike previous models that had global coefficients, the coefficients in this model are allowed to adapt by location.
Unfortunately, it is not possible to estimate one coefficient per location. In this case, there are \(n\times k\) coefficients, which exceeds the size of the sample (\(n\)). It is not possible to retrieve more information from the sample than \(n\) parameters (this is called the incidental parameter problem.)
A possible solution is to specify a function for the coefficients, for instance, by specifying a trend surface for them: \[ \begin{array}{l} \beta_0(x_i, y_i) = \beta_{01} +\beta_{02}x_i + \beta_{03}y_i\\ \beta_1(x_i, y_i) = \beta_{11} +\beta_{12}x_i + \beta_{13}y_i\\ \beta_2(x_i, y_i) = \beta_{21} +\beta_{22}x_i + \beta_{23}y_i \end{array} \] By specifying the coefficients as a function of the coordinates, we allow them to vary by location.
Next, if we substitute these coefficients in the intial model, we arrive at a final expanded model: \[ d_i = \beta_{01} +\beta_{02}x_i + \beta_{03}y_i + \beta_{11}I_i +\beta_{12}x_iI_i + \beta_{13}y_iI_i + \beta_{21}Ed_i +\beta_{22}x_iEd_i + \beta_{23}y_iEd_i + \epsilon_i \]
This model has now nine coefficients, instead of \(n\times 3\), and can be estimated as usual.
It is important to note that since models generated based on the expansion method are based on the use of trend surfaces, similar caveats apply with respect to multicollinearity and overfitting.
A different strategy to estimate models with spatially-varying coefficients is a semi-parametric approach, called geographically weighted regression (see Brunsdon et al., 1996).
Instead of selecting a functional form for the coefficients as the expansion method does, the functions are left unspecified. The spatial variation of the coefficients results from an estimation strategy that takes subsamples of the data in a systematic way.
If you recall kernel density analysis, a kernel was a way of weighting observations based on their distance from a focal point.
Geographically weighted regression applies a similar concept, with a moving window that visits a focal point and estimates a weighted least squares model at that location. The results of the regression are conventionally applied to the focal point, in such a way that not only the coefficients are localized, but also every other regression diagnostic (e.g., the coefficient of determination, the standard deviation, etc.)
A key aspect of implementing this model is the selection of the kernel bandwidth, that is, the size of the window. If the window is too large, the local models tend towards the global model (estimated using the whole sample). If the window is too small, the model tends to overfit, since in the limit each window will contain only one, or a very small number of observations.
The kernel bandwidth can be selected if we define some loss function to minimize. A conventional approach (but not the only one), is to minimize a cross-validation score of the following form: \[ CV (\delta) = \sum_{i=1}^n{\big(y_i - \hat{y}_{\neq i}(\delta)\big)^2} \] In this notation, \(\delta\) is the bandwidth, and \(\hat{y}_{\neq i}(\delta)\) is the value of \(y\) predicted by a model with a bandwidth of \(\delta\) after excluding the observation at \(i\). This is called a leave-one-out cross-validation procedure, used to prevent the estimation from shrinking the bandwidth to zero.
GWR is implemented in the package spgwr. To estimate models using this approach, the function sel.GWR, which takes as inputs a formula specifying the dependent and independent variables, a SpatialPolygonsDataFrame (or a SpatialPointsDataFrame), and the kernel function (in the example below a Gaussian kernel):
delta <- gwr.sel(formula = z ~ x + y, data = Hamilton_TAZ, gweight = gwr.Gauss)
Bandwidth: 25.59084 CV score: 399.3492
Bandwidth: 41.36552 CV score: 418.0564
Bandwidth: 15.84156 CV score: 363.7496
Bandwidth: 9.816173 CV score: 323.2672
Bandwidth: 6.092278 CV score: 300.7067
Bandwidth: 3.790784 CV score: 306.2781
Bandwidth: 5.801654 CV score: 299.5063
Bandwidth: 5.313051 CV score: 298.1325
Bandwidth: 4.731597 CV score: 298.2069
Bandwidth: 5.045611 CV score: 297.8656
Bandwidth: 5.040167 CV score: 297.8649
Bandwidth: 5.020481 CV score: 297.8638
Bandwidth: 5.022892 CV score: 297.8638
Bandwidth: 5.022961 CV score: 297.8638
Bandwidth: 5.023002 CV score: 297.8638
Bandwidth: 5.022961 CV score: 297.8638
The function gwr estimates the suite of local models given a bandwidth:
model.gwr <- gwr(formula = z ~ x + y, bandwidth = delta, data = Hamilton_TAZ, gweight = gwr.Gauss)
model.gwr
Call:
gwr(formula = z ~ x + y, data = Hamilton_TAZ, bandwidth = delta,
gweight = gwr.Gauss)
Kernel function: gwr.Gauss
Fixed bandwidth: 5.022961
Summary of GWR coefficient estimates at data points:
Min. 1st Qu. Median 3rd Qu. Max. Global
X.Intercept. -19.8637 -6.3578 -2.6342 -1.3318 1.3375 -4.248
x 7.4674 17.7597 19.7687 25.1595 38.9585 21.948
y 21.7465 33.2582 38.5066 48.8746 96.7801 46.299
The results are given for each location where a local regression was estimated. Lets append to our tidy dataframe for plotting:
Hamilton_TAZ.t <- left_join(Hamilton_TAZ.t,
data.frame(GTA06 = Hamilton_TAZ$GTA06, model.gwr$SDF@data))
Joining, by = "GTA06"
Column `GTA06` joining character vector and factor, coercing into character vector
Hamilton_TAZ.t <- rename(Hamilton_TAZ.t, beta0 = X.Intercept., beta1 = x, beta2 = y)
The results can be mapped as shown below (try mapping beta1, beta2, localR2, or the residuals gwr.e):
ggplot(data = Hamilton_TAZ.t, aes(x = long, y = lat, group = group,
fill = beta0)) +
geom_polygon(color = "white") +
scale_fill_distiller(palette = "YlOrRd", trans = "reverse") +
coord_equal()
You can verify that the residuals are not spatially autocorrelated:
moran.test(model.gwr$SDF$gwr.e, Hamilton_TAZ.w)
Moran I test under randomisation
data: model.gwr$SDF$gwr.e
weights: Hamilton_TAZ.w
Moran I statistic standard deviate = -0.033896, p-value = 0.5135
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
-0.004534922 -0.003378378 0.001164212
Some caveats with respect to GWR.
Since estimation requires the selection of a kernel bandwidth, and a kernel bandwidth requires the estimation of many times leave-one-out regressions, GWR can be computationally quite demanding, especially for large datasets.
GWR has become a very popular method, however, there is conflicting evidence regarding its ability to retrieve a known spatial process, and interpretation of the spatially-varying coefficients must be conducted with a grain of salt, although this seems to be less of a concern with larger samples - but at the moment it is not known how large a sample is safe (and larger samples also become computationally more demanding). As well, the estimation method is known to be sensitive to unusual observations. At the moment, I recommend that GWR be used for prediction only, and in this respect it seems to perform as well, or even better than alternatives approaches.
A model that can be used to take direct remedial action with respect to residual spatial autocorrelation is the spatial error model.
This model is specified as follows: \[ y_i = \beta_0 + \sum_{j=1}^k{\beta_kx_{ij}} + \epsilon_i \]
However, it is no longer assumed that the residuals \(\epsilon\) are independent, but instead display map pattern, in the shape of a moving average: \[ \epsilon_i = \lambda\sum_{i=1}^n{w_{ij}^{st}\epsilon_i} + \mu_i \]
A second set of residuals \(\mu\) are assumed to be independent.
It is possible to show that this model is no longer linear in the coefficients (but this would require a little bit of matrix algebra). For this reason, ordinary least squares is no longer an appropriate estimation algorithm, and models of this kind are instead estimated based on maximum likelihood.
Spatial error models are implemented in the package spdep.
As a remedial model, it can account for a model with a misspecified functional form. We know that the underlying process is not linear, but we specify a linear relationship between the covariates in the form of \(z = \beta_0 + \beta_1x + \beta_2y\):
model.sem1 <- errorsarlm(formula = z ~ x + y,
data = Hamilton_TAZ@data,
listw = Hamilton_TAZ.w)
summary(model.sem1)
Call:errorsarlm(formula = z ~ x + y, data = Hamilton_TAZ@data, listw = Hamilton_TAZ.w)
Residuals:
Min 1Q Median 3Q Max
-2.835717 -0.844457 0.028902 0.778175 2.571815
Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.3968 0.5859 -7.5044 6.173e-14
x 21.9446 1.2605 17.4092 < 2.2e-16
y 47.2179 1.8584 25.4077 < 2.2e-16
Lambda: 0.55022, LR test value: 59.092, p-value: 1.5099e-14
Asymptotic standard error: 0.066516
z-value: 8.2721, p-value: 2.2204e-16
Wald statistic: 68.427, p-value: < 2.22e-16
Log likelihood: -443.6038 for error model
ML residual variance (sigma squared): 1.0906, (sigma: 1.0443)
Number of observations: 297
Number of parameters estimated: 5
AIC: 897.21, (AIC for lm: 954.3)
The coefficient \(\lambda\) is positive (indicative of positive autocorrelation) and high, since about 50% of the moving average of the residuals \(\epsilon\) in the neighborhood of \(i\) contribute to the value of \(\epsilon_i\).
You can verify that the residuals are spatially uncorrelated (note that the alternative is “less” because of the negative sign of Moran’s I coefficient):
moran.test(model.sem1$residuals, Hamilton_TAZ.w, alternative = "less")
Moran I test under randomisation
data: model.sem1$residuals
weights: Hamilton_TAZ.w
Moran I statistic standard deviate = -0.82584, p-value = 0.2044
alternative hypothesis: less
sample estimates:
Moran I statistic Expectation Variance
-0.031552712 -0.003378378 0.001163896
Now consider the case of a missing covariate:
model.sem2 <- errorsarlm(formula = log(z) ~ x,
data = Hamilton_TAZ@data,
listw = Hamilton_TAZ.w)
summary(model.sem2)
Call:errorsarlm(formula = log(z) ~ x, data = Hamilton_TAZ@data, listw = Hamilton_TAZ.w)
Residuals:
Min 1Q Median 3Q Max
-0.4014843 -0.0670803 0.0079365 0.0791249 0.4647178
Type: error
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.67842 0.18291 9.1763 < 2.2e-16
x 1.92656 0.48773 3.9500 7.814e-05
Lambda: 0.91625, LR test value: 469.65, p-value: < 2.22e-16
Asymptotic standard error: 0.02273
z-value: 40.31, p-value: < 2.22e-16
Wald statistic: 1624.9, p-value: < 2.22e-16
Log likelihood: 177.2607 for error model
ML residual variance (sigma squared): 0.013849, (sigma: 0.11768)
Number of observations: 297
Number of parameters estimated: 4
AIC: -346.52, (AIC for lm: 121.13)
In this case, the residual pattern is particularly strong, with more than 90% of the moving average contributing to
Alas, in this case, the remedial action falls short of cleaning the residuals, and we can see that they still remain spatially correlated:
moran.test(model.sem2$residuals, Hamilton_TAZ.w, alternative = "less")
Moran I test under randomisation
data: model.sem2$residuals
weights: Hamilton_TAZ.w
Moran I statistic standard deviate = -3.3267, p-value = 0.0004395
alternative hypothesis: less
sample estimates:
Moran I statistic Expectation Variance
-0.116544541 -0.003378378 0.001157221
This would suggest the need for alternative action (such as the search for additional covariates).
Ideally, a model should be well-specified, and remedial action should be undertaken only when other alternatives have been exhausted.
This concludes Practice 15.